Phase transition in the detection of modules in sparse networks 
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We present an asymptotically exact analysis of the problem of detecting communities in sparse 
random networks. Our results are also applicable to detection of functional modules, partitions, and 
colorings in noisy planted models. Using a cavity method analysis, we unveil a phase transition from 
a region where the original group assignment is undetectable to one where detection is possible. In 
some cases, the detectable region splits into an algorithmically hard region and an easy one. Our 
approach naturally translates into a practical algorithm for detecting modules in sparse networks, 
and learning the parameters of the underlying model. 
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In many networks, ranging from online communities to 
food webs, metabolic networks, and genetic regulatory 
networks, there are communities or modules that play 
distinct functional roles. A fundamental problem is to 
detect these communities and understand what role they 
play in the network's structure and dynamics. In social 
networks, these communities are often assortative, mean- 
ing that there is a higher density of connections within 
communities than between them, and many approaches 
to detecting these communities have been proposed (see 
e.g. [l[). In other networks, however, these modules may 
consist of nodes with few connections to each other, but 
which connect to the rest of the network in similar ways. 

In this Letter we analyze a random generative model 
for sparse modular networks, known as the stochastic 
block model. It provides a useful playground for theoret- 
ical ideas and the analysis of algorithms, and is a popular 
model for functional modules in real networks. Using the 
cavity method developed in the physics of disordered sys- 
tems 0, H| we exactly analyze the detectability of these 
modules in the limit of large sparse networks. As a func- 
tion of the parameters, we compute the phase diagram 
and locate the associated phase transitions. 

We distinguish between a detectable phase where it is 
possible to learn the model's parameters and the group 
assignments of the nodes, and a non-intuitive unde- 
tectable phase where learning is impossible because the 
network's topology does not retain enough information 
about the original group memberships. The existence of 
a phase where a certain class of algorithms is unable to 
detect communities was previously predicted but its 
location was only found approximately (and its size over- 
estimated). In addition, unlike previous works based on 
finding a ground state, i.e., minimizing a cost function 
associated with a group assignment [4|, l5[ , our analysis is 
more general as it relies on the properties of the entire 
Boltzmann distribution of group assignments. 



We also unveil a transition from an algorithmically 
"hard" phase, where, we believe, no polynomial algo- 
rithm for learning the groups and parameters exists, to 
an "easy" phase where polynomial algorithms do exist. 
In the latter phase, we show that Belief Propagation 
(BP) [6] works on large networks in essentially linear time 
as a function of their size. BP was previously proposed 
for community detection [7j without, however, the ability 
to learn the parameters of the underlying model. 

Our approach also provides a natural measure of the 
significance of the modules in the network, since it com- 
putes the marginal probability that a given node belongs 
to a given group. If the network does not contain any 
modules, our method correctly infers this fact by making 
these marginals uniform. This is an aspect missing in the 
vast majority of the present approaches to community 
detection. Our theoretical understanding and algorithm 
are also applicable to real world networks, as we discuss 
briefly at the end of this paper (and in detail elsewhere). 
Moreover our approach is not restricted by the details of 
the generative model, and is easily generalized to more 
elaborate models (e.g., those of [8]). 

Stochastic block models. We consider networks of N 
nodes. Each node i has a hidden label £ {1,. . . ,q}, 
specifying which of q groups it is a member of. These la- 
bels are chosen independently, where n a is the probability 
that a given node has label a£ {!,...,#} (normalized so 



that Ea=i 



1). If N a is the number of nodes in each 



group, we have n a = lim^^^ N a /N. 

Once the group assignment is chosen, the model gener- 
ates a graph G as follows. For each pair of nodes i, j with 
i < j, we put an edge between i and j independently with 
probability pt t ,t } , leaving them unconnected with prob- 
ability 1 — Pti,u- We call p a b the affinity matrix. Since 
we are interested in the sparse case where p a b — 0(1/N), 
we will use the rescaled affinity matrix c a & = Np a b and 
assume that c a b — 0(1) in the limit N — > oo. 
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In our setting, the adjacency matrix Ay of the graph 
is the only information available to us. Our goal is to 
learn the parameters q,{n a },{p ab } of the block model, 
as well as the true group assignments {ti}. Special cases 
of this model have often been considered in the literature. 
Planted partitioning, when n a = 1/q, c a b — c ou t for a ^ b 
and c aa — Cin with Cin > c ut) is a classical problem in 
computer science and has been used as a benchmark for 
community detection a a a a fioj j . Planted coloring, 
where n a = l/q, c aa = 0, and c ab = cq/(q - 1), is a 
fundamental problem in constraint optimization [3j, and 
was studied using the cavity method in [TO ]. 

Bayesian inference for block models. Bayesian in- 
ference has been applied to community detection be- 
fore. However, except for some very specific genera- 
tive models [12j, the likelihood function must be com- 
puted approximately, either through Monte Carlo sam- 
pling (e.g. 13]) or variational methods 10]. The cru- 
cial contribution of our work is that the quantities that 
follow from Bayesian inference can be computed exactly 
in the thermodynamic limit using the cavity method, or 
on real finite networks using the BP algorithm in time 
roughly linear in the size of the network. The probabil- 
ity that the model parameters take a given set of values 
{9} = (q, {n a }, {c a f,}), conditioned on the topology of the 
network G, is 



P({9} 



P(G) 



{u} 



The sum is over all possible group assignments {i;}, 
where ti € {1, . . . , q} for each node i. The prior P({9}) 
includes all graph-independent information about the 
values of the parameters. We will assume there is no 
such information available and hence this prior is uni- 
form. In that case, maximizing P({9} | G) over {9} is 
equivalent to maximizing the sum Y^w} P{^"i {Pi} I {^D- 
The function P(G,{U} | {9}) is called the likelihood. 
It is the probability that the model would produce the 
group assignment {U} and the network G, assuming 
that its parameters are {9}. We can write the likeli- 
hood exactly for many different generative models; for 
the stochastic block model defined above, it is 



p(G,{ti}\ w)=n^nKx( i 



Thus P({9} | G) is proportional to the partition sum 
Z({9}) of a generalized Potts model, with Hamiltonian 



n({u}\ mH-]Tiogn ti 



Aij log c u 



:i - a, 



,')log(l 



N 



(2) 



There is a strong 0(1) interaction between connected 
nodes, and a weak 0(1/ N) one between unconnected 



nodes. The logn^ play the role of local fields, enforc- 
ing the prior distribution {n a } on group assignments. 

Inferring the parameters {9} is equivalent to minimiz- 
ing the free energy /({#}) = — log Z({9})/N associated 
with ([2]). If f({9}) has a non-degenerate minimum, then, 
from the saddle point method, {9} is with high probabil- 
ity exactly the set of parameters used in the generation 
of the network. In that case, inferring the parameters of 
the underlying model is possible. 

Assuming that we know, or have learned, the correct 
parameters {#}, how should we determine the group as- 
signment of the nodes? The most likely assignment {ti} 
is the ground state of the Hamiltonian @. However, 
if we want to find an assignment {t{\ that maximizes 
the number of correctly labeled nodes, we need to fol- 
low a different strategy. Namely, we should compute the 
marginal distribution ^(tj) = Yl{t }- 4 °^ 
the label of each node i, where /i is the Boltzmann distri- 
bution of ([2|). The most probable group assignment for 
node i is then t* — argmax, , Vj(ti). 

It can be proven in general [14| that this marginaliza- 
tion maximizes the number of correctly labeled nodes in 
the thermodynamic limit, and that it is a better choice 
than using the ground state of ©. Furthermore, a con- 
figuration chosen according to the Boltzmann distribu- 
tion has, asymptotically, the correct group sizes and the 
correct number of edges between each pair of groups, 
while for the ground state this is not true; finding the 
minimum bisection, for instance, creates the illusion of 
two groups even in a completely random graph fl5j |. 
Moreover marginalization is algorithmically easier than 
searching for the ground state. The expected number 
of correctly labeled nodes can be estimated as J^i v i {pt ) > 
even without knowing the original assignment. 

Belief Propagation. We could estimate the free en- 
ergy using Monte Carlo (MC) sampling, and we do this 
for comparison. But a faster algorithm is Belief Propaga- 
tion (BP), known in physics as the cavity method [2|, |3j. 
It is exact in the thermodynamic limit as long as the net- 
work is locally treelike, and as long as connected correla- 
tions decay rapidly as a function of topological distance. 

To derive the BP equations 0,0], one introduces "mes- 
sages" ipl~^ J and ip J t ~* 1 for each pair of nodes (i,j). These 
are conditional marginals in the cavity method. For in- 
stance, ipl^ 3 is the probability that i would be in group 
U if j were removed from the network. Assuming condi- 
tional independence between the neighbors of each node 
and neglecting lower order terms, the messages must be 
a fixed point of a consistency equation, 
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(3) 



for each edge Here di is the set of i's neighbors, 

the field h ti = jj ^2 k ^2 tk 'hkU'Vti, summarizes the influ- 
ence of the non-edges, and Z 1 ^ 3 is a normalizing factor. 
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FIG. 1: Learning for q = 2 groups with n a = 1/2, average 
degree c = 3, and e = c ou t/ci n = 0.15. If we initialize it in the 
ordered region, i.e., with eo < 0.37, our algorithm infers the 
correct value of e. Inset: the free energy as a function of e. 
Note the minimum at e = 0.15, and the paramagnetic region 
for e > 0.37. 



We start with random messages, and iterate (j3|) until we 
reach a fixed point. This typically takes just a few itera- 
tions, and each step takes linear, O(N), time. 

The marginals corresponding to the BP fixed point are 



Vi {U) = Q./Z i )rk i e- h **Il j& H 
free energy is 

/Bp({0}) = -i$>gZM 



- y logz^-- 

N ^ 6 9. 



and the 



where Z ZJ 



i-yj 



J2 a CaaV'a^V'a - ^ 1 - For more details, see [3|,|6(. Requiring 
that /bp({#}) is stationary we update the parameters to 
their most-likely values given the fixed point 



J2 c ab (^t 



and n' a = Vi(a) / 'N . Starting with a suitable initial 
value {9o}, we compute {9'} and iterate until conver- 
gence (see Fig. [IJ, as in the expectation- maximization 
algorithm [16| . To learn the number of groups q, we run 
the algorithm with several values of q' . The free energy 
/bp decreases with q and then stays constant for q' > q. 

Phase diagrams. For illustration we use the case of 
planted partitions and colorings, n a = l/q, Cab — Cout 
for a b, and c aa — c ln . We observe three different 
cases governing the free energy landscape /bp{#}- In the 
"paramagnetic" phase, the free energy is constant in the 
vicinity of the true value of {9}. Learning is impossible, 
and the marginals are = \jq for all nodes. In this 

case the overlap between the original assignment and the 
one resulting from BP marginalization, defined as 



Q({ti}, {%}) = max 



(4) 



(where S q is the permutation group) is zero, and the orig- 



inal assignment is undetectable. Generalizing llj, Il7j . 
one can show there is essentially no difference between 
a graph produced by the block model and a completely 
random graph of the same average degree; the free energy 
of the two ensembles is asymptotically identical. 

In the ordered phase, /bp has an attractive global min- 
imum at the true value of {9}, and BP rapidly infers 
the correct parameters. This is illustrated in Fig. [2] 
As e = Cout/cin varies from (q completely separated 
groups) to 1 (a pure random graph), we observe a con- 
tinuous phase transition from an ordered phase with pos- 
itive overlap to a paramagnetic phase with zero overlap. 
Thus there is a second-order transition from a detectable 
to an undetectable phase. 

A third situation arises if /bp{#} has both a para- 
magnetic fixed point and the ordered fixed point at the 
true {9}. In this case, the two phases co-exist and the de- 
tectability transition is first-order; see Fig.[2]on the right. 
The phase transition is located by comparing the free en- 
ergies of the two phases. However, even if the ordered 
fixed point has a lower free energy, it is not easy to find 
it unless the initial messages are close to the true group 
assignment. All but an exponentially small set of initial 
messages will lead to the paramagnetic fixed point. This 
situation is typical of mean-field first-order phase transi- 
tions. In fact, recent results about random optimization 
problems show that finding the lower- free- en ergy phase 
in this case is an extremely hard problem 
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Only when the paramagnetic phase is no longer locally 
stable does inference become easy. We can compute the 
location of the transition to this easily-detectable phase 
analytically by analyzing how a small random perturba- 
tion to the paramagnetic fixed point propagates as the 



BP equations are iterated llj, jl9| . It follows that for 



Cout 



> qVc, 



(5) 



the original group assignment is dynamically attractive 
and hence many algorithms, e.g. MC or BP, will con- 
verge to it. Note that it is typically still hard to compute 
the ground state of ([2]), even though we can compute 
the marginals, and therefore the optimal estimate of the 
group assignment, asymptotically exactly. 

On the other hand, if (jSJ) is not satisfied then commu- 
nity detection is either impossible, or at best as hard as 
solving the hardest known optimization problems. When 
c ut < Qn the phase transition is of first-order for q > 4, 
as can be retrieved from data presented in [l9j . How- 
ever, the detectable but hard region is so narrow that is 
is quite unlikely to appear in realistic situations. 

Real-world networks. Our algorithm is not restricted 
to large random networks; it is applicable to real net- 
works as well. We tested it on the "Karate Club" net- 
work [20], a common benchmark for community detec- 
tion. For q = 2, BP leads to two different fixed points. 
One corresponds to the actual known division into two 
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FIG. 2: (color online) The best possible overlap between the inferred and original group assignment. Left: community detection 
with q — 2, c = 3 and different values of e = c ou t/ci n . A continuous phase transition between a detectable and a non-detectable 
phase arises at the critical point given by ([5]). Middle: The 4-group community detection benchmark of [9j with c = 16, with 
the same phenomenology. The results agree well with MC simulations, except very close to the critical point where finite-size 
effects are stronger. Right: A planted coloring problem with q = 5 and Ci n = 0, c = c ou t(l — !/<?)■ Both the ordered fixed point 
(green +s, obtained by initializing in the actual group assignment) and the paramagnetic one (blue xs, obtained by initializing 
the algorithm in a random configuration) exist between c<j and c a . The difference A/ (red) between the paramagnetic and 
ordered free energies shows that modules are in principle detectable as soon as c> c c when A/ > 0. It is in practice impossible 
to find the corresponding fixed point, and detection become feasible only after the spinodal point c a given by ([S}. 



groups. The other has a smaller free energy and thus a 
larger likelihood, and splits the network into high-degree 
nodes and low-degree nodes as found in These two 
fixed points correspond to two local minima of /bp for 
q = 2, and depending on the initial value {#0} BP con- 
verges to one or the other. For q > 2, our algorithm 
converges to fixed points with yet lower values of /bp- 
For q = 4 the best fixed point corresponds to a splitting 
of the two actual groups into high-degree and low-degree 
subgroups. 

The results obtained with MC, which can be easily 
equilibrated for such a small network, are almost identical 
to those of BP in terms of the parameters and marginals, 
and identical in terms of the estimated group assign- 
ments. This demonstrates that BP is a useful approach 
even on real, finite networks that are far from trees. 

Conclusion. We have presented a principled and 
asymptotically exact analysis of the detection of com- 
munities in networks generated by the stochastic block 
model. There is a strict limit on detectability due to 
a transition from a phase where the free energy land- 
scape lets us infer the model's parameters, to a phase 
where it does not. In some cases the communities are 
detectable, but the problem is hard because the attrac- 
tive region around the correct fixed point is exponentially 
small. Our analysis comes with an associated learning al- 
gorithm, which for large sparse networks generated from 
the model is able to learn the number of groups, their ex- 
act sizes, and the affinity matrix p a {,. Our approach and 
our algorithm are easily generalized to other local gener- 
ative models, and we will investigate its performance on 
a variety of real-world networks in the future. 
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